Group Task

Authors

Marco Boso 100535153

Diego Paroli 100554973

Yijia Lin 100452242

Bradley McKenzie 100535241

Linghan Zheng 100540803

Jia Lin 100536210

Isabel Monge 100542532

Objectives and mandatory items

The objective of the delivery is to perform an analysis of the electoral data, carrying out the debugging, summaries and graphs you consider, both of their results and the accuracy of the electoral polls.

Specifically, you must work only in the time window that includes the elections from 2008 to the last elections of 2019.

General comments

In addition to what you see fit to execute, the following items are mandatory:

  • Each group should present before 9th January (23:59) an analysis of the data in .qmd and .html format in Quarto slides mode, which will be the ones they will present on the day of the presentation.

  • Quarto slides should be uploaded to Github (the link should be provided by a member of each group).

  • The maximum number of slides should be 40. The maximum time for each group will be 20-22 minutes (+5 minutes for questions).

  • During the presentation you will explain (summarised!) the analysis performed so that each team member speaks for a similar amount of time and each member can be asked about any of the steps. The grade does not have to be the same for all members.

  • It will be valued not only the content but also the container (aesthetics).

  • The objective is to demonstrate that the maximum knowledge of the course has been acquired: the more content of the syllabus is included, the better.

Mandatory items:

  1. Data should be converted to tidydata where appropriate.

  2. You should include at least one join between tables.

  3. Reminder: information = variance, so remove columns that are not going to contribute anything.

  4. The glue and lubridate packages should be used at some point, as well as the forcats. The use of ggplot2 will be highly valued.

  5. The following should be used at least once:

    • mutate
    • summarise
    • group_by (or equivalent)
    • case_when
  6. We have many, many parties running for election. We will only be interested in the following parties:

    • PARTIDO SOCIALISTA OBRERO ESPAÑOL (beware: it has/had federations - branches - with some other name).
    • PARTIDO POPULAR
    • CIUDADANOS (caution: has/had federations - branches - with some other name)
    • PARTIDO NACIONALISTA VASCO
    • BLOQUE NACIONALISTA GALLEGO
    • CONVERGÈNCIA I UNIÓ
    • UNIDAS PODEMOS - IU (beware that here they have had various names - IU, podem, ezker batua, …- and have not always gone together, but here we will analyze them together)
    • ESQUERRA REPUBLICANA DE CATALUNYA
    • EH - BILDU (are now a coalition of parties formed by Sortu, Eusko Alkartasuna, Aralar, Alternatiba)
    • MÁS PAÍS
    • VOX
  7. Anything other than any of the above parties should be imputed as “OTHER”. Remember to add properly the data after the previous recoding.

  8. Party acronyms will be used for the visualizations. The inclusion of graphics will be highly valued (see https://r-graph-gallery.com/).

  9. You must use all 4 data files at some point.

  10. You must define at least one (non-trivial) function of your own.

  11. You will have to discard mandatory polls that:

-   refer to elections before 2008
-   that are exit polls
-   have a sample size of less than 750 or are unknown
-   that have less than 1 or less fieldwork days
  1. You must obligatorily answer the following questions (plus those that you consider analyzing to distinguish yourself from the rest of the teams, either numerically and/or graphically)
-   Which party was the winner in the municipalities with more than 100,000 habitants (census) in each of the elections?
-   Which party was the second when the first was the PSOE? And when the first was the PP?
-   Who benefits from low turnout?
-   How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?
-   How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?
-   Which polling houses got it right the most and which ones deviated the most from the results?

You should include at least 3 more “original” questions that you think that it could be interesting to be answer with the data.

Marks

The one who does the most things will not be valued the most. More is not always better. The originality (with respect to the rest of the works, for example in the analyzed or in the subject or …) of what has been proposed, in the handling of tables (or in the visualization), the caring put in the delivery (care in life is important) and the relevance of what has been done will be valued. Once you have the mandatory items with your database more or less completed, think before chopping code: what could be interesting? What do I need to get a summary both numerical and visual?

Remember that the real goal is to demonstrate a mastery of the tools seen throughout the course. And that happens not only by the quantity of them used but also by the quality when executing them.

Some dataviz will be extremely positive valued.

Required packages

Insert in the lower chunk the packages you will need

rm(list = ls())
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(DataExplorer)

Data

The practice will be based on the electoral data archives below, compiling data on elections to the Spanish Congress of Deputies from 2008 to the present, as well as surveys, municipalities codes and abbreviations.

# NO TOQUES NADA
election_data <- read_csv(file = "./data/datos_elecciones_brutos.csv")
Rows: 48737 Columns: 471
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): tipo_eleccion, mes, codigo_ccaa, codigo_provincia, codigo_municipio
dbl (424): anno, vuelta, codigo_distrito_electoral, numero_mesas, censo, par...
lgl  (42): FALANGE ESPAÑOLA DE LA JONS, PARTIDO COMUNISTA DEL PUEBLO CASTELL...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cod_mun <- read_csv(file = "./data/cod_mun.csv")
Rows: 8135 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): cod_mun, municipio

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys <- read_csv(file = "./data/historical_surveys.csv")
Rows: 3753 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (4): type_survey, id_pollster, pollster, media
dbl  (51): size, turnout, UCD, PSOE, PCE, AP, CIU, PA, EAJ-PNV, HB, ERC, EE,...
lgl   (1): exit_poll
date  (3): date_elec, field_date_from, field_date_to

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
abbrev <- read_csv(file = "./data/siglas.csv")
Rows: 587 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): denominacion, siglas

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data will be as follows:

  • election_data: file with election data for Congress from 2018 to the last ones in 2019.

    • tipo_eleccion: type of election (02 if congressional election)
    • anno, mes: year and month of elections
    • vuelta: electoral round (1 if first round)
    • codigo_ccaa, codigo_provincia, codigo_municipio, codigo_distrito_electoral: code of the ccaa, province, municipality and electoral district.
    • numero_mesas: number of polling stations
    • censo: census
    • participacion_1, participacion_2: participation in the first preview (14:00) and second preview (18:00) before polls close (20:00)
    • votos_blancos: blank ballots
    • votos_candidaturas: party ballots
    • votos_nulos: null ballots
    • ballots for each party
  • cod_mun: file with the codes and names of each municipality

  • abbrev: acronyms and names associated with each party

  • surveys: table of electoral polls since 1982. Some of the variables are the following:

    • type_survey: type of survey (national, regional, etc.)
    • date_elec: date of future elections
    • id_pollster, pollster, media: id and name of the polling company, as well as the media that commissioned it.
    • field_date_from, field_date_to: start and end date of fieldwork
    • exit_poll: whether it is an exit poll or not
    • size: sample size
    • turnout: turnout estimate
    • estimated voting intentions for the main parties

Cleaning the data – surveys

# Filter dataset
cleaned_surveys <- surveys |>
  mutate(
    # Parse dates variables as date objects
    field_date_from = ymd(field_date_from),
    field_date_to = ymd(field_date_to),
    date_elec = ymd(date_elec),
    # Calculate the number of fieldwork days
    fieldwork_days = as.numeric(field_date_to - field_date_from + 1)
  ) |>
  filter(
    !exit_poll,                           # Exclude exit polls
    date_elec >= ymd("2008-01-01"),       # Exclude polls referred to elections before 2008
    size >= 750,                          # Exclude polls with sample size < 750
    fieldwork_days > 1                    # Exclude polls with 1 or fewer fieldwork days
  )

# Deleting columns that only have NAs
cleaned_surveys <- cleaned_surveys |> 
  select(where(~ !all(is.na(.))))

# Identify party columns dynamically
metadata_columns <- c("type_survey", "date_elec", "id_pollster", "pollster", "media",
                      "field_date_from", "field_date_to", "fieldwork_days", "exit_poll", 
                      "size", "turnout")
party_columns <- setdiff(colnames(cleaned_surveys), metadata_columns)

# Reshape data into long format
tidy_surveys <- cleaned_surveys |>
  pivot_longer(
    cols = all_of(party_columns),  # Reshape party columns
    names_to = "party_raw",        # Raw party names
    values_to = "votes"            # Corresponding voting intentions
  )

#Old version:
# tidy_surveys <- tidy_surveys |>
#   mutate(
#     party = case_when(
#       str_detect(party_raw, "(?i)PSOE|PARTIDO SOCIALISTA") ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
#       str_detect(party_raw, "(?i)CIUDADANOS|C’S") ~ "CIUDADANOS",
#       str_detect(party_raw, "(?i)PNV|EAJ-PNV") ~ "PARTIDO NACIONALISTA VASCO",
#       str_detect(party_raw, "(?i)BNG") ~ "BLOQUE NACIONALISTA GALLEGO",
#       str_detect(party_raw, "(?i)CIU|CONVERGÈNCIA|UNIÓ") ~ "CONVERGÈNCIA I UNIÓ",
#       str_detect(party_raw, "(?i)IU|PODEMOS|EZKER BATUA|UNIDAS PODEMOS") ~ "UNIDAS PODEMOS - IU",
#       str_detect(party_raw, "(?i)ERC|ESQUERRA") ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
#       str_detect(party_raw, "(?i)SORTU|EUSKO ALKARTASUNA|ARALAR|ALTERNATIBA|EH BILDU") ~ "EH - BILDU",
#       str_detect(party_raw, "(?i)MÁS PAÍS") ~ "MÁS PAÍS",
#       str_detect(party_raw, "(?i)VOX") ~ "VOX",
#       TRUE ~ "OTHER"
#     )
#   )


# AMAIUR - EH - BILDU ???
# NA-SUMA - PARTIDO POPULAR??? CIUDADANOS???
tidy_surveys <- tidy_surveys %>%
  mutate(
    party = case_when(
      party_raw == "PSOE" ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
      party_raw == "CIU" ~ "CONVERGÈNCIA I UNIÓ",
      party_raw == "EAJ-PNV" ~ "PARTIDO NACIONALISTA VASCO",
      party_raw == "ERC" ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
      party_raw == "IU" ~ "UNIDAS PODEMOS - IU",
      party_raw == "PP" ~ "PARTIDO POPULAR",
      party_raw == "BNG" ~ "BLOQUE NACIONALISTA GALLEGO",
      party_raw == "CS" ~ "CIUDADANOS",
      party_raw == "EH-BILDU" ~ "EH - BILDU",
      party_raw == "PODEMOS" ~ "UNIDAS PODEMOS - IU",
      party_raw == "VOX" ~ "VOX",
      party_raw == "MP" ~ "MÁS PAÍS",
      TRUE ~ "OTHER")
  )

# Create a column for proper, unqique acronyms
tidy_surveys <- tidy_surveys |> 
  mutate(
    party_code = case_when(
      party == "UNIDAS PODEMOS - IU"~ "PODEMOS-IU",
      party == "OTHER"~ "OTHER",
      TRUE ~ party_raw)
  )

# Select relevant columns
# Getting rid of type_survey, exit_poll (take only 1 value), party_raw
# Do we want to get rid of media as well? Or can it be useful?
final_surveys <- tidy_surveys |>
  select(-type_survey, -exit_poll, -party_raw) |> 
  relocate(fieldwork_days, .after = field_date_to) |> 
  relocate(votes, .after = party_code) 

# Summing all votes based on the party reclassification
final_surveys <- final_surveys |> 
  group_by(across(-votes)) |> 
  summarize(votes = sum(votes, na.rm = TRUE), .groups = "drop") |> 
  arrange(field_date_from)
# We have 1614 surveys (rows from cleaned_surveys), 12 parties (meaning 12 rows per survey). Thus 1614x12=19368 rows

# Can we get rid of some rows? ex. the ones where votes == 0? Or would that be detrimental

# Preview
tidy_surveys
# A tibble: 40,350 × 15
   type_survey date_elec  id_pollster pollster media             field_date_from
   <chr>       <date>     <chr>       <chr>    <chr>             <date>         
 1 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 2 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 3 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 4 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 5 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 6 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 7 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 8 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
 9 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
10 national    2008-03-09 pollster-49 GESOP    EL PERIÓDIC ANDO… 2008-03-07     
# ℹ 40,340 more rows
# ℹ 9 more variables: field_date_to <date>, exit_poll <lgl>, size <dbl>,
#   turnout <dbl>, fieldwork_days <dbl>, party_raw <chr>, votes <dbl>,
#   party <chr>, party_code <chr>

Creating table for party codes

Creating a table to link each party name to its unique code

party_info <- final_surveys |> 
  select(party, party_code) |> 
  unique()

Cleaning the data – election_data

The election_data file is large and requires quite extensive cleaning to make it “tidy”. We will tidy the data to try make it most useful for future analysis. The election data starts off with 48,737 rows and 471 columns. Reducing the number of columns is a clear priority.

First, we look at the quality of the data and see if any information is redundant and can be removed.

plot_intro(election_data)

# we see 1.9% missing colums, identify the cols with no data - we have 9 cols. 
blank_cols <- names(election_data)[sapply(election_data, function(x) all(is.na(x)))]

# drop these columns and also filter to ensure no info outside 2008 to 2019 is included. 
election_data <- election_data |> 
  select(-all_of(blank_cols)) |> 
  filter(anno >= 2008 & anno <= 2019)

# Drop columns that are logical
election_data <- election_data %>%
  select(where(~ !is.logical(.)))

Second, we begin to make the election data tidy. We start by pivoting the data so all columns for party names are within one “party” variable.

# Pivot all the party names and ballot counts to the main table
election_pivot <- election_data |> 
  pivot_longer(
    cols = `BERDEAK-LOS VERDES`:`COALICIÓN POR MELILLA`, # select all party data
    names_to = "party",
    values_to = "ballots"
  )
str(election_pivot)
tibble [20,177,118 × 17] (S3: tbl_df/tbl/data.frame)
 $ tipo_eleccion            : chr [1:20177118] "02" "02" "02" "02" ...
 $ anno                     : num [1:20177118] 2008 2008 2008 2008 2008 ...
 $ mes                      : chr [1:20177118] "03" "03" "03" "03" ...
 $ vuelta                   : num [1:20177118] 1 1 1 1 1 1 1 1 1 1 ...
 $ codigo_ccaa              : chr [1:20177118] "14" "14" "14" "14" ...
 $ codigo_provincia         : chr [1:20177118] "01" "01" "01" "01" ...
 $ codigo_municipio         : chr [1:20177118] "001" "001" "001" "001" ...
 $ codigo_distrito_electoral: num [1:20177118] 0 0 0 0 0 0 0 0 0 0 ...
 $ numero_mesas             : num [1:20177118] 2 2 2 2 2 2 2 2 2 2 ...
 $ censo                    : num [1:20177118] 1838 1838 1838 1838 1838 ...
 $ participacion_1          : num [1:20177118] 677 677 677 677 677 677 677 677 677 677 ...
 $ participacion_2          : num [1:20177118] 1008 1008 1008 1008 1008 ...
 $ votos_blancos            : num [1:20177118] 23 23 23 23 23 23 23 23 23 23 ...
 $ votos_nulos              : num [1:20177118] 13 13 13 13 13 13 13 13 13 13 ...
 $ votos_candidaturas       : num [1:20177118] 1269 1269 1269 1269 1269 ...
 $ party                    : chr [1:20177118] "BERDEAK-LOS VERDES" "ARALAR" "PARTIDO OBRERO SOCIALISTA INTERNACIONALISTA" "ALTERNATIVA MOTOR Y DEPORTES" ...
 $ ballots                  : num [1:20177118] 9 27 1 1 2 238 61 85 4 17 ...
head(election_pivot)
# A tibble: 6 × 17
  tipo_eleccion  anno mes   vuelta codigo_ccaa codigo_provincia codigo_municipio
  <chr>         <dbl> <chr>  <dbl> <chr>       <chr>            <chr>           
1 02             2008 03         1 14          01               001             
2 02             2008 03         1 14          01               001             
3 02             2008 03         1 14          01               001             
4 02             2008 03         1 14          01               001             
5 02             2008 03         1 14          01               001             
6 02             2008 03         1 14          01               001             
# ℹ 10 more variables: codigo_distrito_electoral <dbl>, numero_mesas <dbl>,
#   censo <dbl>, participacion_1 <dbl>, participacion_2 <dbl>,
#   votos_blancos <dbl>, votos_nulos <dbl>, votos_candidaturas <dbl>,
#   party <chr>, ballots <dbl>

We now have a table with 20,177,118 rows and 17 columns.

Things we can do: 1- Delete column tipo_eleccion and vuelta, codigo_distrito_electoral 2- Collapse year and month into one single column 3- Split the dataset in 2: a. one containing all the votes split by parties for every election, and every municipality (just like now but getting rid of the columns votos_blancos, voto_nullos, votos_candidaturas since they are coded at a different level) b. the other contains the column votos_blancos, voto_nullos, votos_candidaturas (which are coded as the total of those types of votes in that election in that particular municipality)

Some have been done below check afterwards

This is more clean than previously, but we still need to aggregate of our party variables into the main party groups. We will do this by creating a mapping table (party_names) that standardizes the raw party names into main party groupings (party_main) using regular expressions.

party_names <- tibble(names = unique(election_pivot$party))

# Party names in the election_data file do not match up perfectly with the abbrev file (i.e. some of the names present in party_names are not in abbrev)
# So it is better to work directly on party_names instead of using abbrev

party_names <- party_names |> 
    mutate(party_main = case_when(
                str_detect(names, "(?i)PSOE|PARTIDO DOS SOCIALISTAS DE GALICIA|PARTIDO SOCIALISTA DE EUSKADI|PARTIDO SOCIALISTA OBRERO ESPAÑOL| PARTIT SOCIALISTA OBRER ESPANYOL") ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
                str_detect(names, "(?i)PARTIDO POPULAR") ~ "PARTIDO POPULAR",
                str_detect(names, "(?i)CIUDADANOS-PARTIDO DE LA CIUDADANIA|CIUDADANOS-PARTIDO DE LA CIUDADANÍA|CIUDADANOS PARTIDO DE LA CIUDADANIA|CIUDADANOS PARTIDO DE LA CIUDADANÍA|CIUDADANOS, PARTIDO DE LA CIUDADANÍA|CIUTADANS") ~ "CIUDADANOS",
                str_detect(names, "(?i)EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO") ~ "PARTIDO NACIONALISTA VASCO",
                str_detect(names, "(?i)BLOQUE NACIONALISTA GALEGO|BNG") ~ "BLOQUE NACIONALISTA GALLEGO",
                str_detect(names, "(?i)CONVERGENCIA I UNIO|CONVERGÈNCIA i UNIÓ| 
CONVERGÈNCIA I UNIÓ") ~ "CONVERGÈNCIA I UNIÓ",
                str_detect(names, "(?i)PODEM|EZKER BATUA|EZKER ANITZA|IZQUIERDA UNIDA|ESQUERRA UNIDA|ESQUERDA UNIDA") ~ "UNIDAS PODEMOS - IU",
                str_detect(names, "(?i)ESQUERRA REPUBLICANA") ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
                str_detect(names, "(?i)BILDU|EUSKO ALKARTASUNA|ARALAR|SORTU|ALTERNATIBA") ~ "EH - BILDU",
                str_detect(names, "(?i)MÁS PAÍS") ~ "MÁS PAÍS",
                str_detect(names, "(?i)VOX") ~ "VOX",
                TRUE ~ "OTHER")
    )

unique(party_names$party_main)
 [1] "OTHER"                             "EH - BILDU"                       
 [3] "PARTIDO POPULAR"                   "UNIDAS PODEMOS - IU"              
 [5] "PARTIDO SOCIALISTA OBRERO ESPAÑOL" "PARTIDO NACIONALISTA VASCO"       
 [7] "CIUDADANOS"                        "ESQUERRA REPUBLICANA DE CATALUNYA"
 [9] "CONVERGÈNCIA I UNIÓ"               "BLOQUE NACIONALISTA GALLEGO"      
[11] "VOX"                               "MÁS PAÍS"                         
# Adding party code to party_names dataframe
party_names <- party_names |> 
  left_join(party_info, by = join_by(party_main == party))

~~ Logic for creating a key for the parties and joining on the data

Joining on names and pivoting was very intensive for the machine. This approach to create a join table and left_join on the party names was selected for efficiency over these 2 other options:

  1. create rowwise summaries for each party group, then pivot_longer to get our summary data. The summing across columns should only call the specific columns includes, then add those values together (across the approx 49000 rows)

  2. If we pivot_longer first, then use case_when and str_detect to create our party_main variable directly on the pivot_longer data.

We do not include these results, but found that the join option was fastest, with pivot_longer then running the case when (b) next fastest, while the match and rowwise summaries across columns were incredibly slow.

We now have a table of the almost 22 million rows and 18 variables. ~~

# Join up all party characteristics to main and other
election_pivot <- election_pivot |> 
  left_join(party_names, by = join_by(party == names))

Now we will include some additional information that will make the analysis potentially easier later, including province and valid vote counts from our data:

# create municipal code to join on municipal names. 
election_pivot <- election_pivot |>
  mutate(cod_mun = paste(codigo_ccaa, codigo_provincia, codigo_municipio, sep="-"), # create municipio code to join
         valid_votes = votos_candidaturas, 
         invalid_votes = votos_blancos + votos_nulos,
         total_votes = valid_votes + invalid_votes)

# join municipality names and create vote count columns
election_pivot <- left_join(election_pivot, cod_mun, by = join_by(cod_mun))  

# check quality of the join and whether NA's have been introduced as municipality names
table(is.na(election_pivot$municipio))

   FALSE 
20177118 

Aggregate election ballot data to main party groups

Now we need to group together all of the votes for “OTHER” variables and create unique identifiers for each individual election in our dataframes.

Currently we have a table of 23 variables with 21,785,439 rows. We can clean this more.

First, identify the redundant data in our election. We can remove:

tipo_eleccion - because all values = 02. It is not useful vuelta = because all values = 1, it is not useful. geographic variables = we will remove codigo_municipio is included in cod_mun which we joined on from the cod_mun table. We keep the autonomous community and proivnce variables for potential future aggregation and analysis. codigo_distrito_electoral - because every value is zero. It is not useful. votos - we have created valid, invalid and total summary variables so will remove votos_blancos and votos_candidaturas and votos_nulos. We have grouped blancos and nulos together as they are deemed unuseful independently.

Notably, we have many NA ballot rows (21,388,704) and a row for each individual party at each election, where will also try to reduce this when we aggregate the party data with the “party_main” variable created.

summary(election_pivot)
 tipo_eleccion           anno          mes                vuelta 
 Length:20177118    Min.   :2008   Length:20177118    Min.   :1  
 Class :character   1st Qu.:2011   Class :character   1st Qu.:1  
 Mode  :character   Median :2016   Mode  :character   Median :1  
                    Mean   :2015                      Mean   :1  
                    3rd Qu.:2019                      3rd Qu.:1  
                    Max.   :2019                      Max.   :1  
                                                                 
 codigo_ccaa        codigo_provincia   codigo_municipio  
 Length:20177118    Length:20177118    Length:20177118   
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 codigo_distrito_electoral  numero_mesas          censo        
 Min.   :0                 Min.   :   1.000   Min.   :      3  
 1st Qu.:0                 1st Qu.:   1.000   1st Qu.:    144  
 Median :0                 Median :   1.000   Median :    454  
 Mean   :0                 Mean   :   7.261   Mean   :   4249  
 3rd Qu.:0                 3rd Qu.:   3.000   3rd Qu.:   1858  
 Max.   :0                 Max.   :3742.000   Max.   :2384269  
                                                               
 participacion_1   participacion_2   votos_blancos       votos_nulos      
 Min.   :      0   Min.   :      0   Min.   :    0.00   Min.   :    0.00  
 1st Qu.:     57   1st Qu.:     86   1st Qu.:    1.00   1st Qu.:    1.00  
 Median :    185   Median :    278   Median :    3.00   Median :    4.00  
 Mean   :   1640   Mean   :   2448   Mean   :   28.71   Mean   :   29.84  
 3rd Qu.:    720   3rd Qu.:   1109   3rd Qu.:   12.00   3rd Qu.:   16.00  
 Max.   :1022073   Max.   :1531231   Max.   :17409.00   Max.   :16527.00  
                                                                          
 votos_candidaturas    party              ballots          party_main       
 Min.   :      1    Length:20177118    Min.   :     1     Length:20177118   
 1st Qu.:    106    Class :character   1st Qu.:     3     Class :character  
 Median :    336    Mode  :character   Median :    15     Mode  :character  
 Mean   :   3025                       Mean   :   372                       
 3rd Qu.:   1364                       3rd Qu.:    93                       
 Max.   :1847096                       Max.   :919701                       
                                       NA's   :19781159                     
  party_code          cod_mun           valid_votes      invalid_votes     
 Length:20177118    Length:20177118    Min.   :      1   Min.   :    0.00  
 Class :character   Class :character   1st Qu.:    106   1st Qu.:    2.00  
 Mode  :character   Mode  :character   Median :    336   Median :    7.00  
                                       Mean   :   3025   Mean   :   58.55  
                                       3rd Qu.:   1364   3rd Qu.:   28.00  
                                       Max.   :1847096   Max.   :33869.00  
                                                                           
  total_votes       municipio        
 Min.   :      2   Length:20177118   
 1st Qu.:    109   Class :character  
 Median :    343   Mode  :character  
 Mean   :   3084                     
 3rd Qu.:   1393                     
 Max.   :1872679                     
                                     

To clean the data more, reduce our dataset and rename key variables so everything is more consistent in English.

tidy_election <- election_pivot |> 
  select(year = anno, 
         month = mes,
         code_community = codigo_ccaa,
         code_province = codigo_provincia,
         code_municipality = cod_mun,
         municipality = municipio,
         total_votes,
         valid_votes,
         invalid_votes,
         number_tables = numero_mesas,
         population = censo,
         participation_1 = participacion_1,
         participation_2 = participacion_2,
         party,
         ballots,
         party_main)

Now we group by each individual election to summarise the votes to each of the main parties. We can then drop the party variable in favour of keeping party_main only. We also lose the ballots variable as it becomes party_ballots in our summarise function.

We identify each election in each province by keeping the descriptive data for each election. Then summarise the number of ballots to each main party group.

tidy_election <- tidy_election |> 
  group_by(year, 
         month, 
         code_community,
         code_province,
         code_municipality,
         municipality,
         total_votes,
         valid_votes,
         invalid_votes,
         number_tables,
         population,
         participation_1,
         participation_2,
         # party,   # not included as we want to group by party_main
         # ballots, # not included as we create our summary of votes by party_main
         party_main) |> 
  summarise(party_ballots = sum(ballots, na.rm=TRUE)) |> 
  ungroup()
`summarise()` has grouped output by 'year', 'month', 'code_community',
'code_province', 'code_municipality', 'municipality', 'total_votes',
'valid_votes', 'invalid_votes', 'number_tables', 'population',
'participation_1', 'participation_2'. You can override using the `.groups`
argument.
tidy_election
# A tibble: 584,844 × 15
    year month code_community code_province code_municipality municipality
   <dbl> <chr> <chr>          <chr>         <chr>             <chr>       
 1  2008 03    01             04            01-04-001         Abla        
 2  2008 03    01             04            01-04-001         Abla        
 3  2008 03    01             04            01-04-001         Abla        
 4  2008 03    01             04            01-04-001         Abla        
 5  2008 03    01             04            01-04-001         Abla        
 6  2008 03    01             04            01-04-001         Abla        
 7  2008 03    01             04            01-04-001         Abla        
 8  2008 03    01             04            01-04-001         Abla        
 9  2008 03    01             04            01-04-001         Abla        
10  2008 03    01             04            01-04-001         Abla        
# ℹ 584,834 more rows
# ℹ 9 more variables: total_votes <dbl>, valid_votes <dbl>,
#   invalid_votes <dbl>, number_tables <dbl>, population <dbl>,
#   participation_1 <dbl>, participation_2 <dbl>, party_main <chr>,
#   party_ballots <dbl>

We now have a tibble of 15 columns with 536,107 rows for analysis. This is much cleaner and faster than previous versions. Our current clean election_data table includes:

Election identifiers:

Timing -> year, month - Area information -> code_community (autonomous community), code_province, code_municipality, municipality, population - General election information -> total_votes, valid_votes, invalid_votes, number_tables, participacion_1, participacion_2 - Party votes received -> party_main, party_ballots

For Isabel, Marco and Brad post-cleaning, describe out data status after cleaning at this stage:

We have 2 primary datasets at this stage, election data and survey data.

The survey data includes:

The election data includes:

  • year
  • month
  • party name (with non-primary parties grouped in major)
  • votes received for the party

Mandatory questions

1.Which party was the winner in the municipalities with more than 100,000 habitants (census) in each of the elections?

2. Which party was the second when the first was the PSOE? And when the first was the PP?

3. Who benefits from low turnout?

4. How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?

5. How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?

6. Which polling houses got it right the most and which ones deviated the most from the results?

Additional questions

SOME IDEAS FOR THE ORIGINAL QUESTIONS TO START?

  • Which regions had the most predictable votes (i.e. consistently voted for the same party) and which regions were the most undecided (i.e. had the most variance in there votes across) between the 2008 and 2019 elections?

  • Map the outcomes over time - plotly on the results? Not sure as I didn’t do a map

  • Can we load in other data? think that would go down well? Maybe predict the next election results based on previous trends of the 5 years and compare to see if the following election followed the trend? Think Javi would be happy with new data.

7. Jacklyn and Linghan

8. Yijia and Diego

7. Marco, Isabel, Brad